{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# businesses\n",
    "bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:wrangling_missing)=\n",
    "# Missing Values and Records\n",
    "\n",
    "In {numref}`Chapter %s <ch:theory_datadesign>`, we considered the potential\n",
    "problems when the population and the access frame are not in alignment,\n",
    "so we can't access everyone we want to study. We also described problems when\n",
    "someone refuses to participate in the study. In these cases, entire\n",
    "records/rows are missing, and we discussed the kinds of bias that can\n",
    "occur due to missing records. If nonrespondents differ in critical ways from\n",
    "respondents or if the nonresponse rate is not negligible, then our analysis may\n",
    "be seriously flawed.  The example in\n",
    "{numref}`Chapter %s <ch:theory_datadesign>` on election polls showed that increasing the sample size without\n",
    "addressing nonresponse does not reduce nonresponse bias. Also in that chapter,\n",
    "we discussed ways to prevent nonresponse. These preventive measures include\n",
    "using  incentives to encourage response, keeping surveys short, writing clear\n",
    "questions, training interviewers, and investing in extensive follow-up\n",
    "procedures. Unfortunately, despite these efforts, some amount of nonresponse is unavoidable. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When a record is not entirely missing, but a particular field in a record is\n",
    "unavailable, we have nonresponse at the field level. Some datasets use a\n",
    "special coding to signify that the information is missing.  We saw that the Mauna\n",
    "Loa data uses `-99.99` to indicate a missing CO<sub>2</sub> measurement. We found only seven of these\n",
    "values among 738 rows in the table.\n",
    "In this case, we showed that these missing values have little impact on the analysis."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values for a feature are called *missing completely at random* when those records with the missing data are like a randomly chosen subset of records. That is, whether or not a record has a missing value does not depend on the unobserved \n",
    "feature, the values of other features, or the sampling design. \n",
    "For example, if someone accidentally breaks the laboratory equipment at Manua Loa and CO<sub>2</sub> is not\n",
    "recorded for a day, there is no reason to think that the level of CO<sub>2</sub> that day\n",
    "had something to do with the lost measurements.\n",
    "\n",
    "At other times, we consider values *missing at random given covariates* (covariates are other features in the dataset).\n",
    "For example, the type of an ER visit in the DAWN survey is missing at random given\n",
    "covariates if, say, the nonresponse depends only on race and sex (and not on the type of visit or anything else). \n",
    "In these limited cases, the observed data can be weighted to accommodate for nonresponse."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some surveys, missing information is further categorized as to whether the\n",
    "respondent refused to answer, the respondent was unsure of the answer, or the interviewer\n",
    "didn't ask the question. Each of these types of missing values is recorded\n",
    "using a different value. For example, according to the [codebook](https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/DAWN-2010/DAWN-2010-datasets/DAWN-2010-DS0001/DAWN-2010-DS0001-info/DAWN-2010-DS0001-info-codebook.pdf), many questions in the DAWN survey use a\n",
    "code of `-7` for not applicable, `-8` for not documented, and `-9` for missing. \n",
    "Codings such as these can help us further refine our study of nonresponse. \n",
    "\n",
    "After nonresponse has occurred, it is sometimes possible to use models to\n",
    "predict the missing data. We describe this process next. \n",
    "But remember, predicting missing observations is never as good\n",
    "as observing them in the first place."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At times, we substitute a reasonable value for a missing one to create a \"clean\"\n",
    "dataframe.  This process is called *imputation*. Some common approaches for\n",
    "imputing values are *deductive*, *mean*, and *hot-deck* imputation.\n",
    "\n",
    "In deductive imputation, we fill in a value through logical relationships with other features. \n",
    "For example, here is a row in the business dataframe for San Francisco restaurant\n",
    "inspections. The zip code is erroneously marked as \"Ca\" and latitude and\n",
    "longitude are missing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>city</th>\n",
       "      <th>...</th>\n",
       "      <th>postal_code</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>phone_number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5480</th>\n",
       "      <td>88139</td>\n",
       "      <td>TACOLICIOUS</td>\n",
       "      <td>2250 CHESTNUT ST</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>...</td>\n",
       "      <td>Ca</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>+14156496077</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      business_id         name            address           city  ...  \\\n",
       "5480        88139  TACOLICIOUS  2250 CHESTNUT ST   San Francisco  ...   \n",
       "\n",
       "     postal_code latitude  longitude  phone_number  \n",
       "5480          Ca      NaN        NaN  +14156496077  \n",
       "\n",
       "[1 rows x 9 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bus[bus['postal_code'] == \"Ca\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can look up the address on the USPS website to get\n",
    "the correct zip code, and we can use Google Maps to find the latitude and\n",
    "longitude of the restaurant to fill in these missing values."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mean imputation uses an average value from rows in the dataset that aren't missing.\n",
    "As a simple example, if a dataset on test scores is missing scores for some students,\n",
    "mean imputation would fill in the missing value using the mean of the nonmissing scores.\n",
    "A key issue with mean imputation is that the variability in the imputed\n",
    "feature will be smaller because the feature now has values that are identical\n",
    "to the mean. This affects later analysis if not handled properly---for\n",
    "instance, confidence intervals will be smaller than they should be (these topics are covered in {numref}`Chapter %s <ch:inf_pred_theory>`).\n",
    "The missing values for CO<sub>2</sub> in Mauna Loa used a more sophisticated averaging technique that included neighboring seasonal values. \n",
    "\n",
    "Hot-deck imputation uses a chance process to select a value at random from rows\n",
    "that have values. As a simple example, hot-deck imputation could fill in\n",
    "missing test scores by randomly choosing another test score in the dataset.\n",
    "A potential problem with hot-deck imputation is that the strength of a\n",
    "relationship between the features might weaken because we have added randomness.\n",
    "\n",
    "For mean and hot-deck imputation, we often impute values based on other records in the\n",
    "dataset that have similar values in other features. \n",
    "More sophisticated imputation techniques use nearest-neighbor methods to find\n",
    "similar subgroups of records and others use regression techniques to predict\n",
    "the missing value.\n",
    "\n",
    "With all of these types of imputation, we should create a new feature that\n",
    "contains the altered data or a new feature to indicate whether or not the\n",
    "response in the original feature has been imputed so that we can track our changes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decisions to keep or drop a record with a missing value, to change a value, or to remove a feature\n",
    "may seem small, but they can be critical. One anomalous record can seriously\n",
    "impact your findings. Whatever you decide, be sure to check the impact of\n",
    "dropping or changing features and records. And be transparent and thorough in\n",
    "reporting any modifications you make to the data. It's best to make these\n",
    "changes programmatically to reduce potential errors and enable others to\n",
    "confirm exactly what you have done by reviewing your code."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same transparency and reproducible precautions hold for data transformations, which we discuss next."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
